ENTROPIC INEQUALITIES AND MARGINAL PROBLEMS 



TOBIAS FRITZ AND RAFAEL CHAVES 

Abstract. A marginal problem asks whether a given family of marginal distributions for some 
set of random variables arises from some joint distribution of these variables. Here we point 
out that the existence of such a joint distribution imposes non-trivial conditions already on the 
level of Shannon entropies of the given marginals. These entropic inequalities are necessary (but 
not sufficient) criteria for the existence of a joint distribution. For every marginal problem, a 
list of such Shannon-type entropic inequalities can be calculated by Fourier-Motzkin elimination, 
and we offer a software interface to a Fourier-Motzkin solver for doing so. For the case that the 
hypergraph of given marginals is a cycle graph, we provide a complete analj'tic solution to the 
problem of classifying all relevant entropic inequalities, and use this result to bound the decay of 
correlations in stochastic processes. Furthermore, we show that Shannon-type inequalities for dif- 
ferential entropies are not relevant for continuous-variable marginal problems; non-Shannon-type 
inequalities are, both in the discrete and in the continuous case. In contrast to other approaches, 
our general framework easily adapts to situations where one has additional (conditional) inde- 
pendence requirements on the joint distribution, as in the case of graphical models. We end with 
a list of open problems. 

A complementary article discusses applications to quantum nonlocality and contextuality. 



1. Introduction 

This work concerns two lines of research which we would like to introduce separately and relate 
to each other afterwards. 

Marginal problems. Imagine you have three coins Ai, A2 and A3. However, for some physical 
reason, you can only flip two of them at a time. Upon flipping Ai and A2 together, you find that 
these two coins always give the same outcome: two heads occur with a relative frequency of ^ and 
two tails occur with a relative frequency of ^. Upon flipping A2 and A3 together, the same behavior 
ensues. However upon flipping Ai and A3 together, you find the exactly opposite result, so that 
the two outcomes are never identical: the two combinations of one head and one tail occur with 
probability ^ each. 

Now what will happen when you only flip a single coin? Clearly, all three pairwise combinations 
are consistent in the sense that they predict each coin to yield heads and tails with relative frequency 
i each. Therefore, this has to be the resulting outcome distribution of flipping only one coin by 
itself. 
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But what would happen if you were able to flip all three coins at once? Since Ai and A2 
are perfectly correlated, and also A2 and A3 are perfectly correlated, it follows that Ai and A3 
should also be perfectly correlated. This contradicts the observation that Ai and A3 are perfectly 
anticorrelated. Therefore, no three-variable joint distribution compatible with the given marginals 
exists, due to the transitivity of perfect correlation. Although the two-coin outcome distributions 
give consistent single-coin distributions, no three-coin distribution is compatible with the given 
data! Hence either there is some systematic error in the coin flips, or the coins are operated by 
some weird mechanism creating the required distribution as a function depending on which two are 
flipped together. 

This is the simplest non-trivial example of a marginal problem: given a list of joint distributions 
of certain subsets of random variables Ai, . . . , An, is it possible to find a joint distribution for all 
these variables, such that this distribution marginalizes to the given ones? One obvious necessary 
condition is that for any two of the given distributions which can be marginalized to the same 
subset of variables, the resulting marginals should be the same. In the "three coins" example, we 
found this to be the case, since the single-coin marginals were unambiguous: each coin by itself is 
unbiased, and this does not depend on which other coin it is tossed together with. The example 
also shows that this consistency condition is not sufficient to guarantee the existence of a joint 
distribution. 

Marginal problems naturally arise in several different fields. To us, the most familiar one is 
"quantum nonlocality" [8, 23], which features close relatives of our unextendability example — 
with four "coins" instead of three and with slightly different given marginal two-coin distributions. 
In this case, the unextendability has actually been observed experimentally [6], bearing witness 
to the counterintuitive behavior of quantum theory. As part of the endeavor to understand the 
counterintuitive features of quantum theory, marginal problems have become an active field of 
research within the foundations of quantum mechanics [2, 3, 11, 38]. Unfortunately, references from 
this field to the existing mathematical literature on the subject are virtually nonexistent. It has 
been noticed before [32, Sec. 2.2.1.1] that this constitutes a "disturbing example of a split between 
mathematics and physics" . One of our goals is to ameliorate this situation a bit by pointing out 
some of the literature on both sides. 

Marginal problems have also arisen in the following other fields of mathematical research: 

(1) knowledge integration of expert systems in artificial intelligence [56], 

(2) database theory and privacy aspects of databases [1, 17, 21], 

(3) Vorob'ev's theory of coalition games [58]. 

We will present a more detailed exposition of how marginal problems arise in these contexts in 
section 2. 

The origin of this subject can be traced back to at least 1955, when Bass [7] has considered 
the case of three continuous variables with given two-variable marginals. Other early works in- 
clude [20], [31] and [57]. A more abstract and general formulation in terms of ti-algebras can be 
found e.g. in [27]. Our "three coins" example appears in most papers treating marginal prob- 
lems [2, 57, ...], sometimes more prosaically phrased [38, 48]. Some further randomly selected 
references studying marginal problems are [5, 40, 52]. Also various quantum versions of marginal 
problems have been considered, see e.g. [33]. 

Entropic inequalities. The most central concept in information theory is that of Shannon entropy 
and its siblings like conditional entropy, mutual information and relative entropy. Its importance 
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manifests itself not only in the widespread use of Shannon entropy within information theory it- 
self [12, 62], but also in related fields like biodiversity studies [36], Bayesian statistics [28], research 
on collective social behavior [47], or additive combinatorics [54]. The proofs of theorems in in- 
formation theory often rely on inequalities between Shannon entropies and/or derived quantities. 
Therefore, it is of fundamental importance to understand all the inequalities which hold between 
entropies of certain collections of random variables. The so-called Shannon-type inequalities [62, 
Ch. 13] are the most frequently used kind of entropic inequalities. This is the class of all those 
linear inequalities which can be derived from the basic inequalities 

H{X)>0, H{X\Y) = H{XY) - H{Y) >0, 
I{X : Y) = H{X) + H{Y) - H{XY) > 0, 

I{X : Y\Z) = H{XZ) + H{YZ) - H{XYZ) - H{Z) > 0. 

where each symbol X , Y , Z stands for a random variable or collection of random variables. These 
basic inequalities express non-negativity of Shannon entropy H(X), conditional entropy H{X\Y), 
mutual information I{X : Y) and conditional mutual information I{X : Y\Z). Many commonly 
used information-theoretic inequalities are Shannon-type inequalities; see e.g. [39] and references 
therein for a rather general class of such inequalities and their applications. 

A linear programming framework for Shannon-type entropic inequalities has been introduced 
in [61], including the software packakge ITIP which determines whether a given linear entropic 
inequality is a valid Shannon- type inequality or not. Further progress has been made in [64], where 
it was shown that not all valid linear inequalities among entropic quantities are Shannon-type 
inequalities. 

Occurences of entropic inequalities outside of information theory itself include applications to 
group theory [14] and to Kolmogorov complexity [25]. In this paper, we consider an application of 
entropic inequalities which was originally introduced, in a less general context, by Braunstein and 
Caves in [9]. In our terminology and notation, they were working with the Shannon- type entropic 
inequality 

HiA.Ai) + HiA^) + H{A^) < HiA.A^) + HiA^A^) + HiA^A^), 

which is valid for all joint distributions of the four variables. They found that this inequality can 
be violated by using a "four coin" example similar to the one above: the relevant joint distributions 
of variable pairs are known, so that their entropies are well-defined, and the inequality can be eval- 
uated. The resulting violation witnesses that there cannot exist any joint distribution compatible 
with the given two- variable marginals. In this sense, entropic inequalities give necessary conditions 
for the existence of solutions to marginal problems. See proposition 5.2. 

An important advantage of the application of entropic inequalities to marginal problems is that 
they apply irrespectively of the number of outcomes of each variable. On the negative side, entropic 
inequalities are only a sufficient criterion for the existence of a solution to a marginal problem: 
many marginal problems have no solution, although no violations of corresponding Shannon-type 
entropic inequalities exist. For a more detailed discussion of these issues, we refer to our companion 
paper [16]. 

Our contributions and structure of this paper. We start in section 2 by introducing marginal 
problems in more detail in order to set up terminology and notation. The "three coins" reappear 
as example 2.4. 
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In section 3, we use polyniatroids as a rigorous formalism for the discussion of Shannon- type 
inequahties and introduce partial polymatroids, which represent marginal problems on the level of 
entropies. 

Section 4 explains how to use Fourier- Motzkin elimination to compute all Shannon-type entropic 
inequalities for a given marginal scenario, while possibly taking into account additional (condi- 
tional) independence requirements on the joint distribution like in graphical models. We offer a 
MATHEMATICA [60] package which generates, given the marginal scenario, the corresponding input 
for the Fourier- Motzkin solver PORTA [18]. Although these computations are very demanding, we 
have found them to be of some use in our work on quantum nonlocality [16]. We also outline an 
application to causal inference along the lines of [51]. 

We analytically solve the partial polymatroid version of marginal problems for the family of 
n-cycle marginal scenarios, C„ with n £ N, in section 5. The resulting inequalities form a single 
equivalence class under the action of the cyclic symmetry of C„. Our proof implies that there are no 
non-Shannon- type inequalities in any Cn- Finally, we use these rt-cycle inequalities to give a bound 
on the decay of correlations in stationary stochastic processes. 

Shannon- type entropic inequalities for differential entropy are discussed in section 6, where we 
show them to not give any non-trivial constraints on the existence of solutions to marginal problems 
for continuous variables. 

Section 7 shows that non-Shannon-type entropic inequalities can be useful for detecting the 
non-existence of solutions to marginal problems, both for discrete and for continuous variables. 

Finally, we conclude in section 8 with a list of open problems. 

Notation and conventions. All our logarithms are with respect to base 2. In particular, we 
measure entropy in bits. We take [n] = {1, . . . , n} to be a finite index set and write 2["'l for the set 
of all subsets of [n]. With the exception of sections 6 and 7, all random variables occurring in this 
paper are assumed to be discrete in such a way that their Shannon entropy converges. 

2. Marginal problems for random variables 

In this section, we introduce marginal problems as discussed, in different variants, for example 
in [2, 7, 20, 27, 29, 31, 38, 40, 46, 52, 57]. 

We consider a finite number of random variables Ai, . . . , An ■ For any subset 5 C [n], we also 
write As for the tuple {Ai)i£S- In particular, A[„] — {Ai, . . . , An) represents the joint distribution 
of all variables, and we stipulate Aq = 0. 

In many situations, one knows the distribution of As for certain subsets S* C [n], but not the 
joint distribution of A^n]- Sometimes, it is unclear whether a joint distribution even exists; in this 
case, one deals with a marginal problem. 

Now if the distribution of As is known for some S* C [n], then taking marginals down to a 
smaller subset S' C S yields the distribution of As'. Therefore, the collection of sets of variables 
with known distribution is naturally closed under taking subsets. This motivates the following 
definition: 

Definition 2.1 ([57]). A marginal scenario M on [n] is a non-empty collection Ai = {Si, . . . , S\m\} 
of subsets Si C [n] such that if S ^ M. and S' C S, then also S' € A4. 

In its topological interpretation, such a combinatorial structure is also known as an abstract 
simplicial complex [42]. 

Clearly, a marginal scenario is determined by those subsets Si A4 which are not contained in 
any other Sj A4; such a subset is maximal. In particular, it is sufficient to specify these maximal 
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subsets when defining a particular marginal scenario. This is the approach taken in [2, Sec. 2.4], 
where these maximal subsets are called measurement contexts. For any set system X C 2["1, we let 
X~ denote the set system containing X together with all the subsets of sets in X. It is the marginal 
scenario generated by X. 

We now formalize the idea of specifying a family of compatible marginal distributions for a 
marginal scenario Ai. If P is a probability distribution on some set of variables containing C [n] , 
then we write P^g for the marginal distribution associated to As- 

Definition 2.2. A marginal model on Ai is a collection {Pg^)seM of probability distributions 
Pg^ for the variables Ag such that these distributions are compatible: for any pair S,T ^ Ai with 
T C- S , taking the marginal Pg^rp of the distribution Pg^ over those variables not contained in T 
yields precisely the given distribution Pj^ , 

P^T^P^- (1) 

In particular, this compatibility condition implies that for any triple of subsets S,S',T E A4 
with T C S-, S", we have P^^. = P^j,, as in [2]. 

The prime example of a marginal model P^ arises when starting from a joint distribution P 
and defining the marginal models in terms of its marginals as Pg^ = P^g. However, we will see 
that not all marginal models can be constructed in this way. The following terminology follows the 
literature on quantum contextuality, e.g. [38, Thm. 6]. 

Definition 2.3. P-^ is non-contextual if there exists a joint distribution P = P(ai, . . . , a„) for all 

variables Ai, . . . , An such that its marginals coincide with the distributions occurring in the marginal 
model, i.e. if Pg^ ~ P\g for all S Q [n]. Otherwise, P^ is contextual. 

The idea behind the term "contextual" is that although a contextual marginal model allows no 
joint distribution for all variables in the conventional sense, one can easily find compatible joint 
distributions which depend on the subset of variables S* C [n\. If one does this, then the joint 
distribution depends on the context in which it is probed. 

Under certain assumptions on Ai, every marginal model is non-contextual [57]. In general, this 
is not so, with the most elementary example being the "triangle" : 

Example 2.4 ([38, 57, ...]). We now formalize the "three coins" example from the introduction 
in this language. The corresponding marginal scenario is denoted by = C3 and consists of three 
variables ^1,^2,^3 where the three pairwise marginals are assumed to be given, but not the full 
joint distribution, so that 

-c 



C3 = {{1,2},{1,3},{2,3}}- 

The three variables take values in the set {heads, tails}, such that each single variable separately 
has a uniformly random outcome. The two- variable distributions are 



(2) 



P{t2}(^l 


= ai,yl2 


= 02) 


P{2,3} (^2 


= 02,^3 


= 03) 




= fli, ^3 


= 03) 



1/2 


if fli ~ 02 





if fli ^ 02 


1/2 


if 02 = 03 





if 02 ^ 03 





if fli = 03 


1/2 


if ai ^ as 
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These determine the single- variable distributions ^j'f'j, ^{2}' ^{3} ^ consistent way, thereby 
satisfying the premise of definition 2.2. We claim that this marginal model is contextual. To see 
this, let P be a hypothetical joint distribution. Then any joint outcome probability 

P{Ai = ai,A2 = 02,^3 = as) 

needs to vanish: if ai ^ 02, this follows from the requirement ^^{i,2}=^'^i 2}' '^2 7^ 13, it follows 

from P\{2,3}—P^2 3}^ remaining case is ai — a2 — 03, and then it is implied by ^^{2,3}=^|2^3i,- 

This example demonstrates the existence of contextual marginal models. The question now is 
the following: 

Problem 2.5 (Marginal Problem). How to decide whether a given marginal model in a given 
marginal scenario Ai is non-contextual or contextual? 

Remark 2.6. When the number of outcomes of each variable is finite, this is a linear programming 
problem: the joint distribution can be identified with its list of outcome probabilities, which are 
nonnegative real numbers subject to a list of equations (reproduction of the given marginals). See [2] 
for an explicit formulation of this linear program. However, the number of variables in this linear 
program is in general exponential in M; when the number of outcomes of each variable is d, then it 
is d", corresponding to the size of a joint distribution. In fact, certain classes of marginal problems 
are known to be NP-complete [45]. 

The entropic inequalities we are going to study in the following sections are necessary conditions 
for a marginal model to be non-contextual. 

We end this section by discussing how marginal problems, and intimately related issues, arise in 
various mathematical sciences. This list is certainly not complete, but merely represents our own 
limited knowledge. 

(1) In quantum theory, a physical system is described by a Hilbert space Ti, to which one 
associates the C*-algebra B{'H) of bounded operators. A state is a unit vector t/; £ TL, 
while an observable is a hermitian operator A = A* £ B{'H). li A = J2i KQi is the spectral 
decomposition^ of A, then the Born rule states that the outcome distribution associated to a 
measurement of A is given by P{A = Xi) — Qiip). If two or more observables Ai, . . . , An 
are hermitian operators which commute pairwise, and have spectral decompositions Aj = 
X^i ^j.iQj,i, then they are jointly measurable, and their joint distribution is given by 

However, if the variables are not pairwise commuting, then they cannot be jointly measured, 
and their joint outcome distribution is undefined. A marginal scenario AA can then be 
defined as containing all those subsets S C [n] for which the associated operators are 
pairwise commuting. The resulting outcome distributions then define a marginal model on 
M. As witnessed by the Kochen-Specker theorem [35] and by Bell's theorem [8, 23], this 
marginal model is often contextual. This is the essence of quantum contextuality. We refer 
to [2, 38] and our companion paper [16] for more detail and the explanation for why some 
of these contextual marginal models can be interpreted as quantum nonlocality. The latter 
marginal models — also known as nonlocal correlations — have been found to be a useful 
resource for information processing and communication tasks [10, 43, 44]. 



Since we take our random variables to be discrete, we also assume the operator A to be discrete, i.e. to have 
pure point spectrum. 
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(2) Knowledge integration of expert systems: many artificial intelligence systems aggregate 
information from several sources. In general, none of these sources will provide perfect 
information about the state of the world, but only about certain aspects of it; typically, 
the aspects probed by different sources will overlap. The system then faces the problem 
of integrating the given observations into a consistent picture of the state of the world. 
Mathematically, this boils down to finding a probability distribution consistent with a given 
collection of marginals; in this context, a marginal problem asks whether such knowledge 
integration is possible. For more information on a popular algorithm used for finding a 
global joint distribution and an analysis of its behavior in the contextual case, we refer 
to [56]. 

(3) Database theory and privacy aspects of databases: this is best illustrated with an example. 
A health insurance provider typically has an enormous database of patients which contains, 
for each patient, a long list of properties like gender, age, diseases, nationality, clinical 
history, etc. The associated statistics of this data will be of great interest to managers, 
politicians, researchers in medicine and the general public. However, making the complete 
database available would compromise the privacy of the patients and is therefore not an 
option: even after discarding patient names, the database is still likely to contain enough 
information to make some individual entries be uniquely identifiable with certain persons. 
Hence there is a balance between the usefulness of the data released and the privacy of the 
individuals in the database. One approach for achieving such a balance lies in releasing 
only certain marginals of the table [1]: for example, the joint distribution of gender, age, 
and heart disease prevalence. Given a collection of such marginals, the question is obvious: 
what do those marginals reveal about the database itself [17]? This is very similar to a 
marginal problem and we expect some of our methods to also apply in this situation. The 
question of contextuality of a marginal model reappears as soon as one also adds random 
components to the marginals before releasing them in order to further increase privacy [21]. 

(4) Vorob'ev's theory of coalition games: a coalition game features a finite set of players to- 
gether with a collection of coalitions, where each coalition is a subset of the players. A 
player may belong to any number of coalitions. Each player has a finite set of pure strate- 
gies representing his possible actions. A mixed strategy is a probability distribution over 
the set of pure strategies. The strategies chosen by the players do not have to be inde- 
pendent, so that the global strategy of all players is specified by a joint distribution over 
strategy assignments. Roughly speaking, each coalition specifies a joint mixed strategy for 
its players. The question then is whether there is a global mixed strategy marginalizing to 
those specified by the coalitions. This is a marginal problem. 

We note that the standard notion of "coalition game" is not Vorob'ev's, but rather refers 
to cooperative game theory [49]. 

3. The entropy cone and polymatroids 

Surprisingly, in some cases the contextuality of a marginal model can be detected already by only 
looking at the Shannon entropies of the given marginal distributions. To our knowledge, this has 
first been noticed by Braunstein and Caves [9] in the case of marginal models arising from quantum 
nonlocality. Before getting to these ideas, we begin by recalling some properties of Shannon entropy. 

The entropy cone. Let Ai, . . . , An be random variables with a certain joint distribution. We do 
not explicitly specify the codomain of these variables, which can be any set; however, we always 
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assume it to be finite or countable in such a way that all the entropies which we consider in the 
following are well-defined and finite. We write P{ai, . . . , a„), or simply P, for their joint distribution 
over outcome tuples (ai, . . . , a„). 

For any subset of indices S C [n], we consider the joint Shannon entropy associated to the 
marginal distribution i-^s of As, 

H{As) ^-Y,P\s{as)\ogP\s{as). 

as 

As a degenerate case, the distribution P|0 is the unique probability distribution on one outcome, 
and hence its entropy is given by H{Aij) — 0. 

The vector (A0), . . . , , where the components range over all of the H{As) for 5* C [n], 

is a point in M^' ' . The collection of all points in R^' ' which arise from probability distributions in 
this way is difficult to characterize. Its closure is known to be a convex cone [62, Thm. 15.5], the 
(closed) entropy cone r„ [61, 62, 64]. Since any closed convex cone can be described in terms of 
the linear inequalities which bound it, one may now ask: what is the description of r„ in terms of 
linear inequalities? 

There are some obvious linear constraints satisfied by all points in H{-) G r„. For example, 
H{As) > for every S' C {1, . . . ,n}, and H{A(i,) = 0. More generally, every point in iJ(-) e r„ 
satisfies the following basic inequalities [61], 

< H{As) (with equality if 5 = 0) (3) 
H{As) < H{At) if 5 C T (4) 
H{AsnT) + H{Asut) < H{As) + H{At) (5) 

for every pair of subsets S*, T C {1, . . . , n}. As already stated in the introduction, the second 
and third inequalities can be regarded as saying that the conditional entropy H{At\As) and the 
conditional mutual information I{As : AT\AsnT) are non-negative. 

Definition 3.1. A linear inequality in the H{As)'s is a Shannon-type inequality if it is a non- 
negative linear combination of the basic inequalities. 

— * 

The collection of basic inequalities would be a complete description of r„ if all inequalities valid 
for r„ were Shannon-type. However, for n > 4, this is known not to be the case [64], [62, Thm. 15.7]. 
As far as we know, finding the complete inequality description of r,j remains an elusive problem. 

The Shannon-type inequalities are the ones which are most commonly used in information theory. 
We will also focus on Shannon- type inequalities for the most part. 

Polymatroids. We now would like to ask, is it possible to detect the contextuality of a marginal 
model by looking at the entropies of the given marginals and finding that a Shannon-type inequality 
is violated? Clearly, in order for such an inequality to be applicable, it should only depend on those 
H{As) for which S G M, so that the distribution of As is given. We would like to talk about 
the collection of Shannon-type inequalities which can be used in this way. This requires us to not 
work with the cone F* , but rather with the collection of all vectors in M^' ' which satisfy the basic 
inequalities. This is the convex cone of polymatroids: 
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Definition 3.2 ([22]). A polymatroid is a pair ([n], f) where n £ N and f is the rank function, a 
function / : 2["1 — >■ M which satisfies the basic inequalities 

< fiS) (with equality if S = %) (6) 

f{S) < f{T) if SCT (7) 

f{SnT) + f{SUT)<fiS) + f{T) (8) 

for all S*, T C [n] . 

We will usually identify a polymatroid with its rank function. The reason for introducing poly- 
matroids lies in the fact that they are defined via the basic inequalities: upon replacing f{S) by 
H{As), the linear inequalities satisfied by all polymatroids become precisely the Shannon- type 
entropic inequalities. 

If Ai, . . . , An are discrete random variables with a certain joint distribution, then f : S H{As) 
is a polymatroid. A polymatroid arising in this way is called entropic. Due to the existence of non- 
Shannon- type inequalities [64] , not every polymatroid is entropic. Although there are other impor- 
tant classes of polymatroids, like polymatroids associated to hypergraphs [55], network flows [41] 
with applications to network coding [26], we will always have entropy in mind. Notwithstanding, 
the results of this and the following two sections apply generally. 

The submodularity inequality (8) is a natural convexity-like condition which can be interpreted 
as follows. One may think of [n] as a set of possible tasks which can be completed, and of f{S) for 
S C [n] as the amount of resources that have to be spent — for example, work — in order to complete 
all tasks in S. Completing the tasks 5 U T is at most as difficult as completing 5' plus completing 
T, so that f{S U T) < f{S) + /(T); since having completed a task i € S may help in completing 
another task j S T, this inequality will in general be strict. Similar considerations explain (8), if 
one applies this argument to the additional cost relative to the tasks S HT: if the tasks S CiT are 
already all done, then the additional cost to complete SUT should be less than or equal to the cost 
to complete S plus the cost to complete T. This suggests 

f{s u T) - f{s n T) < [f{S) ~ f{s n t)] + [/(t) - f{s n t)] , 

which is (8). 

Since the defining inequalities are linear, the sum of two polymatroids is again a polymatroid; 
similarly, a positive scalar multiple of a polymatroid is again a polymatroid. Therefore, the set of 
all polymatroids on [n] is a convex cone denoted by r„ C K^' ' . As already noted, we have the 
inclusion r„ C r„, which is strict for rt > 4. 

Proposition 3.3. All basic inequalities follow from the following ones: 
f{[n]\{i})<f{[n]) VzG[n], 
f{R) + f{RU{t,j}) < f{RU{t}) + f{RU{j}) yRC[n], i,j e \n]\ R with i ^ j , (9) 
/(0) = 0. 

Proof. This result is well-known [62, Sec. 14]. □ 

Marginal problems for polymatroids. We now define a version of marginal problems which is 
not about random variables, but about polymatroids. By taking entropies, a marginal problem for 
random variables can be mapped into a marginal problem for a polymatroid, such that contextuality 
of the latters implies contextuality of the former (but not conversely, in general); see figure 1. The 
following definition introduces the polymatroid analog of a marginal model: 



10 



TOBIAS FRITZ AND RAFAEL CHAVES 



take entropies 

joint distribution P ^^^-^ polymatroid / 

take marginals restrict to M 

take entropies 

marginal model -j^^ partial polymatroid / 

Figure 1. Relation between the different concepts discussed in the main text. By 
definition, a marginal model (resp. partial polymatroid) is non-contextual if and 
only if it arises from a vertical arrow. 

Definition 3.4. A partial polymatroid on a marginal scenario M is a function 

which satisfies (6), (7) and (8) for all S,T Q M for which 5 U T G A^. 

Intuitively, requiring the inequalities on S", T C with U T G only is analogous to the 
compatibility condition in definition 2.2. 

The most obvious example of a partial polymatroid is the restriction f\j^ of a polymatroid 
/ : 2["1 — J> R to f\M : Al — J> M. However, we will see soon that not all partial polymatroids arise in 
this way. These include some which come from marginal models: 

Proposition 3.5. Let M. he a marginal scenario and P^ a marginal model on Ai for variables 
(Ai)jg[„]. Then the function 

f^ : M^R, H{As) 

is a partial polymatroid on Ai . 

Proof. It is straightforward to check that this satisfies definition 3.4 by using the basic inequali- 
ties (6), (7), (8) in combination with the assumption (1). □ 

Definition 3.6. A partial polymatroid is non-contextual if there is a polymatroid f such that 
f^{S) ^ f{S) for all S eM. Otherwise, f^ is contextual. 

If a marginal model is non-contextual, then the associated partial polymatroid is clearly non- 
contextual, too. Hence, showing the contextuality of a polymatroid is one way to detect the 
contextuality of a marginal model: it gives a sufficient, but in general not necessary, criterion for 
contextuality of the marginal model. In analogy with problem 2.5, we therefore consider: 

Problem 3.7 (Marginal problem for polymatroids). Given a partial polymatroid f^ on Ai, under 
which conditions is it non-contextual? 

Example 3.8. Going back to example 2.4, we again consider the "triangle" or "three coins" 
marginal scenario 

Ca = {{1,2}, {2, 3}, {1,3}}- . (10) 
For any polymatroid /, the basic inequalities 

/({1,3})</({1,2,3}) 
/({l,2,3})-f/({2})</({l,2}) + /({2,3}) ^ ' 
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imply directly 

/({I, 3}) + /({2}) < /({I, 2}) + /({2, 3}) . (12) 

For later use, we refer to this inequality as the triangle inequality" . It is an inequality for the values 
of / on the subsets in C3 . Hence it can also be evaluated on partial polymatroids over C3 , and such 
a violation witnesses the partial polymatroid's contextuality. For example, this happens for the 
partial polymatroid defined as 

/C3({i}) = /C3(|2}) ^ ;C3(|3|) ^ 1^ f -^{{l,2}) = f^{{2,3}) = 1, /'^^({1,3}) = 2. 

This partial polymatroid can be interpreted as arising from the following marginal model, similar 
to (2), 

^{'1,2} (^1 

^{'1,3} (^1 

As in (2), Ai and A2 are perfectly correlated, and likewise A2 and A3. But now, Ai and A3 are 
completely uncorrelated (instead of anticorrelated, as in (2)). 

With this definition, every single variable has 1 bit of entropy, the joint distribution of Ai and 
A2 has 1 bit of entropy, likewise for Ai and ^3, and the joint distribution of Ai and ^3 has 2 bits 
of entropy. This realizes the partial polymatroid /''^. Since the triangle inequality (12) is violated, 
there exists no joint distribution for all three variables marginalizing to the given ones, and (12) 
witnesses the contextuality of this marginal model. 

Remark 3.9. As in example 2.4, the reason why the marginal model in this example is contextual 
is that perfect correlation is transitive: if Ai is perfectly correlated with A2, and A2 is perfectly 
correlated with ^3, then Ai should also be perfectly correlated with ^3. We regard the triangle 
inequality (12) as one quantitative version of this intuition. 

However, the entropies associated to the marginal model (2) do not violate (12): on the level of 
entropies, there is no difference between (2) and the marginal model in which Ai are A3 are also 
perfectly correlated (instead of anti-correlated) , and this latter marginal model is obviously non- 
contextual. Entropies cannot distinguish between correlation and anti-correlation, and are generally 
very coarse invariants of probability distributions. From this point of view, it is quite surprising 
that entropic inequalities can witness the contextuality of some marginal models like (13) at all. 
See also [16]. 

4. Computations and applications 

Fourier-Motzkin elimination. Determining whether a given partial polymatroid /-^ is non- 
contextual means checking whether there exist values f{S) for S G [n] \A4 which extend the given 
partial polymatroid to a "full" polymatroid /. This requires values f{S) such that all the basic 
inequalities (9) hold not just on A4, but on all of 2^"'. This is a linear programming problem, as in 
Yeung's linear programming framework for Shannon-type entropic inequalities [61], and therefore 
can be solved in time polynomial in its size. However, the size of this linear program is 2", which 

■^Note that it indeed has considerable similarity to the ordinary triangle inequality for a metric, d{x, z) < d{x, y) + 
d{y,z). 



ai,A2 = 02) 
02, A3 = 03) 

ai, A3 = 03) 



i/2 


11 fli — a2 





if ai ^ a2 


1/2 


if 02 = 03 





if 02 ^ 03 


1/4 


Vai,a3. 



(13) 
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typically grows exponentially in the size of A4. This happens, for example, for the family of n-cycle 
scenarios C„ discussed in section 5, where the number of missing values f{S), i.e. the number of 
unknowns of the linear program, is 2" — 2n — 1. However, for variables with \ outcomes, this 
is still significantly smaller than the d" of remark 2.6. 

Given a polymatroid / : 21"! -> R, one computes the restriction /|^ : — > K by simply 
forgetting the values f{S) for all S G [n] \ M. Geometrically speaking, this is equivalent to 
projecting the point / G M^'"' down to by forgetting some of the coordinates. Therefore, the 
set of non-contextual partial polymatroids on y\4 is a projection of r„ along a map R^'"' — ;> 
which throws away some of the coordinates. In particular, this set is also a convex cone, and we 
denote it by . If an inequality description of is known, then deciding the non-contextuality of 
a given partial polymatroid is simple: one only needs to check whether it satisfies all the inequalities 
defining F-'^. Therefore, it is very desirable to compute the inequality description of F-'^. For the 
cycle scenarios A4 = €„ to be defind in section 5, we will find an analytic solution to this problem. 

A natural way to determine such projections F^ would be calculate the extremal rays of F„ 
and drop the irrelevant coordinates of these. The resulting points in R-'^ generate the polyhedral 
cone F-'^. However, determining all the extremal rays of the cone F„ is a very hard problem, with 
explicit solutions known only for n < 5 [30, 50, 53]. Hence this method is not practical. 

A better way of determining F-^ is to start from the inequality description (9) of F„ and then 
apply Fourier-Motzkin elimination. Fourier-Motzkin elimination [59] is a standard method for 
calculating the inequality description of a projection of a polyhedral cone, given its inequality 
description. The correctness of the algorithm represents a proof showing that the projected cone 
is again polyhedral, i.e. also has a description in terms of a finite number of linear inequalities. 
Fourier-Motzkin elimination has been implemented in various computational geometry software 
packages such as PORTA [18]. 

Since our objective of calculating the inequality description of F^ is a problem of precisely this 
form, it is straightforward in principle to apply Fourier-Motzkin elimination in order to achieve 
this for any given M-. For n < 5, we have successfully used the PORTA software in order to do 
so for various M. In particular, we have verified the upcoming proposition 5.1 for n < 5. Our 
MATHEMATICA program for generating a PORTA input file from the specification of A4 is available 
online [15]. 

The contextuality of a marginal model can be detected by Shannon-type entropic inequalities 
if and only if the associated partial polymatroid lies in F-^. Since Fourier-Motzkin elimination 
computes all the facet inequalities of F^ , the resulting entropic inequalities are tight in the following 
sense: they detect the contextuality of any marginal model whose contextuality can in principle be 
detected by Shannon-type entropic inequalities. 

Including (conditional) independence constraints. In certain applications like the one of 
the following subsection, or in one of those which we have considered in [16], one has additional 
(conditional) independence constraints on the joint distributions P(ai,...,a„) constituting the 
solution space of a marginal problem. More explicitly, for disjoint subsets R,S,Tc [n], one might 
want to allow only those P which satisfy the conditional independence relation that As and At are 
conditionally independent given Aji (where R might be empty), which can be written in entropic 
terms as 

I{As : At\Ar) = H{Arus) + H{Ar^t) - H{Arusut) ~ H{Ar) ^ 0. (14) 
In general, one can also have several such constraints at the same time; For ease of presentation, 
we restrict to one such constraint, but the general case works in exactly the same way. 
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The set of all entropy assignments S H{As) for joint distributions satisfying (14) is a face of 
the entropy cone r„. Again, we approximate r„ by the polymatroid cone r„ 3 r„, of which the 
polymatroid analogue of (14), which is the equation 

f{R US) + f{R UT)~f{RUSUT)- f{R) = 0, (15) 

also defines a face. As before, the image of this face of r„ under the projection R^' ' — !> defines 
a cone in R^, whose inequality description can be computed by Fourier-Motzkin elimination. A 
partial polymatroid f"^ over A4 equals the restriction /|^ of a polymatroid / satisfying (15) if and 
only if it lies in this cone in R-^ . 

In conclusion, our method allows the determination of a finite list of tight Shannon-type entropic 
inequalities for marginal scenarios also in the presence of (conditional) indepedendence constraints. 
Entropic inequalities seem especially useful to us in this kind of situations, since a (conditional) 
independence constraint is a linear equation on the level of entropies, so that the linear programming 
methods and Fourier-Motzkin elimination still apply. On the level of probabilities however, this is 
no longer the case, since a (conditional) independence constraint is a quadratic equation, resulting 
in a difficult system of linear inequalities subject to quadratic equations. Due to the relative ease 
of working on the level of entropies, we see the relevance of our formalism with respect to marginal 
problems in particular in situations where the marginal problem comes with additional (conditional) 
independence constraints. 

Computational results. Computing projections of cones via Fourier-Motzkin elimination is costly. 
Using standard Fourier-Motzkin elimination and making use of symmetries to switch between the 
facet description and the extremal ray description of a polyhedral cone is practical for cone di- 
mensions of up to « 40 for the highly symmetrical cones arising from combinatorial optimization 
problems [19]. In our case, the polymatroid cone r„ has dimension 2l"'l — 1, so that we expect n = 5 
to be the highest number of variables for which one can calculate the facets of any interesting F-^ 
with current methods. We have described one such successful application, to a marginal problem 
with additional independence constraints, in [16], and now turn to another application for which 
our computations have unfortunately not terminated. 

We also have not been able to terminate any attempted calculation of any F^ for n > 6 with 
those A4 in which we were interested. Due to this high computational complexity, analytical results 
like proposition 5.1 are highly relevant also for practical computations using our approach. 

Example application: inference of common ancestors in Bayesian networks. This sub- 
section is based on [51], where Steudel and Ay derive entropic inequalities for a certain kind of 
causal inference. We outline now how our systematic approach to entropic inequalities could in 
principle extend their results. 

A Bayesian network is a mathematical model for the causal dependencies between random vari- 
ables. We restrict to a brief discussion and refer to [37] for more detail. One of the several equivalent 
definition is this: 

Definition 4.1. Let G = {VtE) he an acyclic directed graph. 

(1) For V ^V, the set of descendants is de(w) = {w ^V\{v,'w) G E}; the set of parents is 
pa(t;) = {w G y I (w, u) e E}. 

(2) A Bayesian network over G consists of a discrete random variable Ay for every v € V, 
so that the Ay have a joint distribution which satisfies the local Markov property.- Ay 
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is conditionally independent o/ ^y\(j(,(ti) given Apa(i,), or, equivalently, the corresponding 
conditional mutual information vanishes, 

I{A, : ^y\dc(.)l^pa(.)) = yveV. (16) 

Intuitively, the edges E model the causal dependencies between the variables Ay-, every Ay can 
be regarded as a probabilistic function of ^pa(y) i and this function is independent of everything 
else, so that no other causal dependencies exist. The simplest nontrivial example of a Bayesian 
network is a Markov chain Ai ^ A2 ^ A3 . 

We now imagine a situation in which only a certain subset of variables Ai for i G M C [n] can be 
accessed, so that their joint distribution is known; the Ai for i ^ M on the other hand are "hidden 
variables" which mediate the correlations between the accessible variables via the topology of the 
Bayesian network, but their distribution cannot be determined. The question then is, 

Problem 4.2. What can be said about the topology of the network given only the joint distribution 
of the accessible variables {Ai)i^M^ 

Entropic inequalities give necessary conditions for a certain distribution of the {Ai)i^M to come 
from a certain network topology. In our framework, these are the entropic inequalities corresponding 
to the marginal scenario M. — 2^ C 2["'l with additional independence requirements given by the 
local Markov property (16). As explained earlier, these can be calculated by the familiar Fourier- 
Motzkin elimination algorithm, at least in principle. 

The local Markov property (16) implies other conditional independence relations for the given 
variables, the global Markov conditions [37]. As conditional independence relations, these are linear 
equations for the entropies. As such, they can be used to eliminate many of the joint entropies 
H{As) for 5* C 2^ before starting the Fourier-Motzkin elimination algorithm. Therefore, for 
concrete calculations of entropic inequalities for Bayesian networks, it is useful to include all global 
Markov conditions explicitly in order to speed up the computation, although all of these equations 
are implied by the local Markov property. 

Example 4.3. Consider the network topology displayed in figure 2. In this case, the local Markov 
conditions are the following: 



/(A2 


■ A^) 


= 0, 


/(A3 


■■ ^61^2^4) 


= 0, 




I{Ai 


■■Ae) 


= 0, 


/(A 


: A2\AiAe) 


= 0, 


(17) 




: A2) 


= 0, 


I{A, 


: A4IA6A2) 


= 0. 





As indicated in figure 2, we assume that the variables Ai, A3 and A^ are accessible, while A2, A4 
and Aq are hidden. The result of [51, Thm. 10] in this very particular special case is that this 
network topology implies the inequality 

2iJ(Aa3A) > H{Ai) + HiAs) + H{A^). (18) 

A slightly better condition has been derived in [24], which is 

H{AiA3) + H{A:iA^) > H{Ai) + HiA^) + H{A^). (19) 

This inequality actually represents a class of three inequalities equivalent to each other under cyclic 
permutations of the variables. 

We have attempted to use Fourier-Motzkin elimination in order to calculate all (Shannon-type) 
entropic inequalities in this scenario and see whether this latter class of inequalities is optimal and 
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Figure 2. Example Bayesian network modeling the dependency relations between 
six random variables. A circle represents an accessible variable, while a square 
stands for a hidden variable. 

whether there are other entropic conditions besides this one, but unfortunately our computation 
has not terminated. 

The interpretation of these inequalities is as follows. If some three observed variables Ai, A3, 
arise from some causal structure in which there is no quantity influencing all three of them, 
then they can be modelled in terms of a network topology in which at most every pair has a 
common ancestor. Since, for each pair, this network of common ancestors is hidden, it can as well 
be subsumed into a single common ancestor variable. This gives rise to the network topology of 
figure 2. On the other hand, if the three variables are influenced by some common variable, then 
their joint distribution cannot arise from a network topology as in figure 2. One way to witness 
this is by violations of inequalities (18) or (19); again, such a violation is a sufficient, but not a 
necessary condition for this kind of causal inference. The most drastic example of a violation of 
these inequalities occurs when all three variables are identically distributed and perfectly correlated. 



5. CONTEXTUALITY IN THE n-CYCLE MARGINAL SCENARIO 

We now consider a family of marginal scenarios generalizing example 3.8. Already Vorob'ev [57] 
(see also [38, Sec. Ill]) has considered the case where the marginal scenario A4 is taken to be the 
n-cycle 

Cn = {{l,2},...,{n-l,n},{n,l}}- . (20) 

This generalizes (10). Much more recently, the 5-cycle has also been considered in relation to 
quantum contextuality [34]. A complete characterization of (non-)contextuality of marginal models 
with binary variables on C„ has been given in [4]. 

In order to have somewhat more convenient notation, we regard alH G N modulo n as represen- 
tatives of the elements of [n] = {1, . . . , n}. In particular, n + 1 and 1 stand for the same element of 
[n], so that we can write 

C„ = {{l,2},...,{n,n+l}}-, 
which will turn out to be a more useful notation for the proof below. 
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Proposition 5.1. Let n > ?>. A partial polymatroid /''" is non- contextual if and only if the 
inequalities 

+ + f"i{j}) <T. f"(i^'^ + ^}') V* = l,...,n (21) 

hold. 

Note that all these inequalities are equivalent to each other via cyclic permutations of [n\. In 
different form, these inequalities have also been derived in [9]. 

Proof. For ease of notation, we drop the superscript and write / instead of /''" . 

If / is non-contextual, we take it to be the restriction of a full polymatroid, also denoted by /. 
We now show that (21) then follows from the triangle inequality (12) by induction on n. Due to 
cyclic symmetry, it is sufficient to prove this in the case i = n. For n = 3, the induction basis, this 
is precisely (12) itself. 

For the induction step, we start with the induction assumption 

n-2 n-2 

/({I, n + /({j'}) ^ E /({J'^' + 1» 

J=2 j=2 

to which we add the triangle inequality 

/({I, n}) + f{{n - 1}) < /({I, n-l}) + f{{n - 1, n}) 
and get by canceling terms 

n—1 n—1 

/({l,n}) + ^/({j})<^/(0-,j + l}), 

J=2 j=l 

as desired. 

Concerning the other implication direction, we start from a partial polymatroid / defined on C„ 
satisiying (21) in addition to the basic inequalities 

< /({*}), (22) 
/({*})</({*,* + !}), /({z + 1}) </({*,* + 1}), (23) 

fi{^,^ + l})<f{{^}) + f{{^ + l}), (24) 

and prove that such an / can be extended to a full polymatroid. 

The inequalities (21) define a convex cone which contains F''". Our goal is to show that these 
two cones actually coincide. To this end, it is enough to consider the extremal rays of the former 
cone and prove that they are non-contextual as partial polymatroids. Since the inequalities (21) 
have integer coefficients, each extremal ray can be represented by a partial polymatroid / assuming 
only integer values. Hence it is enough to prove the assertion for integer-valued /, which we assume 
to be the case from now on. 

We now use induction on the "total rank" value r/ — J2i /({*}) + X)i /({*' * + 1}) in order to 
prove the non-contextuality of /. The base case is r/ = 0, which is trivial. The induction step 
consists in finding a non-zero polymatroid g such that f — f — g\c„ is again a partial polymatroid 
satisfying the requirements (21), (22), (23) and (24). Since rf < rj, the induction assumption 
applies, and /' = h\c^ for some polymatroid h. Then f = {h + 5)|c„j desired. 
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The g we will construct will actually take values in {0, 1}. In order to show that /' = f—g satisfies 
all the desired inequalities, we always proceed as follows: if g saturates a particular inequality 
(i.e. satisfies it with equality), then /' with also satisfy it since / does so. If g does not saturate it, 
then saturation fails by 1, and in all these cases we will find that / does not saturate the inequality 
either. Since / is integer- valued, saturation has to fail by at least 1. Therefore, /' = ./ — <? will also 
satisfy the inequality. 

To find an appropriate polymatroid g, we distinguish four cases. 

Case 1: f{{i,i + 1}) < /({«}) + /({« + 1}) for all i. In information-theoretic terms, this says that 
there is positive mutual information between each variable i and i + 1. 

In this case, we define g to be the polymatroid taking on a constant value of 1 on all 
non-empty subsets of [n]: in particular, <?({«}) = g{{i,i + 1}) = 1 for all i. Then defining 
f = f — g will work: /' satisfies (22) because the assumption implies /({«}) > 1 for all i; 
moreover, since g saturates (21) and (23) for every i, the new /' will satisfy these inequalities 
just as / itself does. Finally, (24) holds for /' since /({i,i + 1}) < f{{i}) + f{{i + 1}) - 1 
by assumption, and g{{i, i + 1}) = .g({j}) + .?({* + 1}) — 1, so that 

f{{i,i + l}) = f{{t,t + l})-g{{i,i + l}) 

< /( W) + + 1}) - 1 - gm) - + 1}) + 1 

= + /'({* + !})• 

Case 2: There is exactly one i for which f{{i,i + 1}) /({*}) + /({* + !})• In entropic terms, 
there is exactly one pair of neighboring variables which are independent. This implies that 
f{{j}) > for all j, since otherwise f{{j,j + 1}) = f{{j}) + f{{j + 1}) would hold for at 
least two values of j . 

We take this i to he i = n without loss of generality, so that 

/({l,n})^/({l}) + /(M) (25) 

is assumed. Then we claim that the set of inequalities (21) is equivalent to the single 
inequality 

n n— 1 

E/(W)<E/({^' * + (26) 

i=l i=l 

given that (23) and (24) hold. For if i 7^ n, we have 

(24) 

j^i, i+1 j 



(25) 



ri-l 



/({l,n}) + ^/({j})+ ^ f{{j}) 

J=2 j=i+l 



-E/(i-?'-^' + i})' 

which is precisely (21). For i = n in turn, (21) coincides with (26) under the assump- 
tion (25). 
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Now let m be the smallest index value with the property that /({m, m + 1}) > /({m + 
1}). Since by assumption, /({2}) < /({I, 2}) < /({I}) + /({2}), we have /({I}) > 0. 
Therefore, (26) implies m < n — 1. 

We define the polymatroid g by taking it to assume the value 1 on any set S C [n] for 
which S" n {1, . . . , m} ^ 0, and otherwise. This gives in particular, 

a({7l) = l^ forje{l,...,™} o(|77 + n) = P forjG{n,l,...,m} 

ywn) I otherwise ' ^VvJ^J^^i) | q otherwise 

This g corresponds to the situtation where the variables 1, . . . , to have 1 bit of entropy and 
are perfectly correlated, while all others are deterministic and have vanishing entropy. 

It needs to be shown that setting f = f — g defines a partial polymatroid of the same 
kind, which means checking whether the equations (22), (23), (24) and (26) hold. We know 
/({j}) > 1 for aU j, so that f'{{j}) > 0. Furthermore, g saturates <?({«}) < g{{i, i + 1}) for 
all i except for i = n; therefore, /' satisfies /'({«}) < f'{{i,i + 1}) for all i^n. Moreover, 
since /({I}) > 0, we have /({I, n}) = /({I}) + /({n}) > /({n}) + 1, and so 

/'( W) - /( W) - 3( W) < /({I, ^}) - 1 = /'(K !})• 
A similar distinction of cases shows /'({« + 1}) < /'({«, * + 1}) for all i. That (24) holds for 
/' can be verified similarly: g saturates this inequality for all j ^ {1, . . . ,to — 1}, whereas 
/ does not saturate it for the other values of j by assumption; therefore, /' satisfies it. 
Finally, g also saturates (26), so that /' will also satisfy it since / does. 
Case 3: Still /({j}) > for all j, but now there are two or more values of i for which /({i. i + 1}) = 
/(») + /({* + !}), 

As in the previous case, we take one of these values to be j = n. Then the same obser- 
vations apply: (21) is equivalent to (26). Moreover, even that inequality is now automatic: 
for fc ^ n being the smallest value for which also /({fc, k + 1}) = /({fc}) + /({fc + 1}), we 
have fc < n — 1 by assumption, and therefore 

1=1 1=1 i=k+2 i=l 

As in the previous case, we define g by setting g{S) to be 1 if 5 fl {1, . . . , fc} 0, while 
g{S) = otherwise. This means in particular, 

g(|.|)=/l forj £{!,.. g(|;7 + l)) = P for J G {«, 1, . . . , fc} 

\ otherwise ' .'AU,J + <^ q otherwise 

Defining /' = f — g now gives the desired new partial polymatroid: by the observation of the 
previous paragraph, it is enough to check (23) and (24), and then (21) will be automatic. 
Checking this can be done as at the end of the previous case. 
Case 4: /({j}) = for some j. Thanks to the cyclic symmety, we may consider the case f{{n}) = 
and /({I}) > without loss of generality. Then let k be the smallest index for which 
/({fc,fc + l}) > /({fc + 1}). In particular, /{{j}) > for ah j G {!,..., fc}. The polymatroid 
g can be defined as in the previous case. Then f = f — g clearly satisfies (22). It also 
satisfies (23), since g saturates these except for g{{n}) < g{{n, 1}) and g{{k + 1}) < 
g{{k, k + 1}), which is fine since / does not saturate them. Similarly for (24), which / does 
not saturate for i G {1, . . . , fc - 1} thanks to f{{i + 1}) = f{{i, i + l}) < /({«}) + f{{i + 1}) 
and g saturates for all other values of i. Finally, f'{{n}) — guarantees that /' also 
satisfies (21). 
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This ends the proof. □ 
Writing the inequahties (21) in terms of entropies of random variables reads 

H{A,A,+i) + J2 H{A,) <Y,H{A,A,+i) Vz-l,...,n (27) 

This inequality can be applied to marginal problems on C„: 

Proposition 5.2. Let n > 3. There is a marginal model on Cn whose contextuality gets detected 
by (27). 

Proof. An example is given by a generalization of (13) to all n, again with G {heads, tails}, 

^{M+i}(^'^=«-^'+i^«^+i) = | ifa,^a,+i V* = l,...,n-1, 

P{i,n} (^1 "1 ' = ««)== 1/4 Vai , a„ 
The entropies associated to this marginal model violate (27). □ 
See [16] for more examples, including many arising from quantum theory. 

Given an integer-valued partial polymatroid satisfying (21), the polymatroids g used in 
the proof of proposition 5.1 are actually entropic, so that / turns out to be a sum of entropic 
polymatroids, and therefore is itself an entropic polymatroid. Hence we have also proven that 
non-Shannon type entropic inequalities cannot be relevant for marginal problems on C„. In other 
words: 

Corollary 5.3. Letn > 3. Every entropic inequality containing only terms H{Aj) and H{AjAj^i), 
j G [n], is Shannon-type. 

Application to correlations in stochastic processes. In terms of mutual information, the 
inequality (27) in the case i = n can be rewritten as 

n— 1 n— 1 

/(Ai ■.A^)>Y^ I{A, : A,+i) - ^ H{A,). (28) 

This can be interpreted as a lower bound on the correlation between Ai and An, given that there 
are certain correlations between each Ai and Ai+i for i = I, . . . ,n — l. This in particular suggests an 
application to stochastic processes, an idea which we briefly explore now. Let {Ai)i^z be a stationary 
stochastic process. Stationarity implies that H{Aj) ~ H{Ai) and I{Aj : Aj+i) = I{Ai : A2) for all 
j £ li. Using this, the inequality can also be written as 

I{Ai : An) > H{Ai) -{n- l)H(A2\Ai). (29) 

We think of this as follows: let Ai be a signal which undergoes n — 1 applications of some noise, 
which results in noisy signals A2, . . . , An. These noise applications do not have to be independent, 
but we require them to not depend on the particular iteration and to not change the distribution 
of the signal in order for the resulting stochastic process to be stationary. We would like to know 
how well the final signal An approximates the original signal Ai. Our inequality (29) gives a lower 
bound on the quality with which the original signal can be recovered from the final noisy one. The 
results of this section also show that this is the best linear inequality between entropies in this 
context. Applying the bound only requires the entropy of the signal Ai to be known together with 
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H{A2\Ai), which quantifies the amount of noise added at each timestep. One consequence is that 
the noisy signal An contains some information about the original signal Ai for all n < H(At \ ^Ai) \ • 

6. Shannon-type inequalities for differential entropy 

It is an appealing feature of entropic inequalities that they apply regardless of the number of 
outcomes of each variable. One may think that this makes entropic inequalities also applicable 
to random variables having an uncountable range, like random variables described by continuous 
probability density functions, if one replaces the discrete entropy — ^.p^logpi by the differential 
entropy — J p{x) \ogp{x) dx. Unfortunately, this turns out not to be the case. The problem with this 
approach is that, for given continuous random variables Ai, . . . , An, the joint differential entropies 
do not form a polymatroid. Although submodularity (8) remains valid [62, (10.136)], monotonicity 
fails: for example, when Ai and A2 are independent and uniformly distributed on [0, e] for some 
e > 0, then the differential entropies are 

h{Ai) ^ h{A2) ^loge , h{AiA2) ^2loge 

So for £ < 1, we have h{AiA2) < h{Ai) < 0. Intuitively, the reason for this is that differential 
entropy quantifies the randomness of a distribution relative to the Lebesgue measure. In particular, 
this relative entropy can become negative, meaning that h(A) itself can become negative. Similar 
considerations apply to differential conditional entropy h{A2\Ai) — h{AiA2) — h{Ai). 

We have seen in the previous sections that entropic inequalities can detect the contextuality 
of marginals models for discrete random variables. Does this also apply to marginal problems 
for continuous random variables if one uses differential entropy? This would be interesting, since 
marginal problems for continuous variables are an important topic with relevance to applied sta- 
tistics [20, 29, 46]. We will show in this section that this is not the case with Shannon-type 
inequalities, but give an example in the next section of a non-Shannon-type inequality which can 
detect the contextuality of a continuous- variable marginal model. 

The only basic inequalities which remain valid for differential entropy are the submodularity 
inequalities (8). Therefore, the Shannon- type inequalities for differential entropy are those which 
are linear combinations of submodularity inequalities only; these coincide with the balanced entropic 
inequalities of Chan [13]. 

Therefore, instead of using partial polymatroids, we now work with suhmodular functions / : 
2["] — K. These are those functions which satisfy (8) and /(0) = 0, but not necessarily (6) or (7). 
Similarly, a function /-^ ; ^ M is called suhmodular if (0) — and / satisfies (8) for those 
S,T eM for which SUT eM. 

In this way, the Shannon-type entropic inequalities (for differential entropy) are precisely those 
inequalities which hold for all suhmodular functions. 

Proposition 6.1. Let : M ^ M. he suhmodular. Then there is a suhmodular function f : 
2["1 ^ M such that f\M = f-^- 

Proof. We choose a set y C [n] such that V ^ M., but such that all proper subsets of V are in A4, 
and define M' = MiJ {V}. Then 



f^' : M' U ^ 



f^{U) ii U &M 

mms,Tcv {f^iS) + f-^{T) ~ f^{S nT)) ii U = V 
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extends /-^ from to M'. We claim that is submodular on A^'. It needs to be shown that 
for any S,T e M' with SVJT ^ M' , the submodularity inequahty 

(S* U T) + (5 n T) < (5) + (T) 

holds. If S* U T e A^, this is true since restricts to the submodular on M. li S\JT <^ M, 
then necessarily S iJT ~V , and the assertion follows from the definition of {V). 

Repeated application of this extension procedure eventually produces a submodular extension of 
/•^ to aU subsets of [n]. □ 

This implies that there are no Shannon-type inequalities for differential entropy which would be 
able to detect contextuality of marginal models: the differential entropy function : M. ^ M. 
associated to a continuous-variable marginal model has an extension to a submodular function 
2["] — ^ K. In particular, it satisfies all inequalities which hold for submodular functions, and these 
are precisely the Shannon-type inequalities for differential entropy. 

7. Non-Shannon-type inequalities 

In this section, we enumerate the variables Ai by indices i S {w, x, y, z} instead of i G {1, 2, 3, 4}, 
as the latter choice might cause confusion. 

It has been known since 1998 [64] that the inclusion r„ C r„ is strict for n > 4. The inequality 

^AHiA^AyA,)^HiA,AyA,)-HiA^A,) + 3H{A^Ay)+3H{A.^A,) 

+ H{A.,Ay) + H{A^A,) + iH{AyA,) - H{A^) - 2H{Ay) - 2H{A,) > ^^^^ 

bounds but not r4 [62, Thm. 15.7]. We can consider this inequality as an entropic inequality 
in the following marginal scenario, named after the authors of [64], 

MzY = {{w,y,z},{x,y,z},{w,x}}~ , (31) 

MzY as a simplicial complex is illustrated in figure 3(a). 

We now consider the partial polymatroid depicted in figure 3(b). In formulas, its values are 

fy(^%) = 0, f^{{w]) = f^{{x}) = ^{{y]) = 2, 

f^{{w,x)) = 4, f'^iiwM) =f''{{w,z}) = f'^iix^y}) = /^^({x,z}) = /^^({y,z}) = 3, 

/^^({«;,y,z}) = /^^({x,y,z}) = 4. 
It violates (the polymatroid analogue of) inequality (30). 

Lemma 7.1. f^^ arises from taking entropies of a marginal model in the marginal scenario A4zy- 

Proof. We start by defining a joint distribution for A^, Ay and A^. Let {awTCty,az, P) be a list of 
four independent and uniformly distributed bits. Then the definitions 

Aw^{aw,l3), Ay^{ay,l3), Az^[az,l3) 

reproduce the desired entropy values (in bits) for all As with S C {w, y, z}. 

Analogous definitions with Ax in place of A^ also define a distribution for As^x,y,z} which re- 
produces the desired entropy values and restricts to the same marginal distribution of as 
the distribution of ^{^j.^.z}. The product distribution between A^ and A^ defines a distribution of 
A{w,x] having the desired properties. This completes the definition of the marginal model. □ 
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Figure 3. (a) The marginal scenario (31) as a simplicial complex; the numbers 
are vertex indices, (b) The partial polymatroid f^^; the numbers are the values 
that f^^ assigns to the simplices. 



Corollary 7.2. The non- Shannon-type inequality (30) detects the contextuality of some marginal 
models in the marginal scenario AizY, although no Shannon-type inequality does so. 

Proof. The first part is clear by the lemma since f^^ violates (30). For the second part, we have 
used the methods and software described in section 4 to compute all the facet inequalities of r^^^ . 
Since the partial polymatroid /^^ has turned out to violate none of these 67 inequalities, it is 
non-contextual as a partial polymatroid. □ 

In particular, this shows that (30) is indeed a non-Shannon-type inequality. 
We now extend corollary 7.2 to the continuous-variable case. The original proof of (30) from [63] 
also works in the continuous-variable case, so that 

~4h{A^AyA^) ~ h{A^AyA,) - hiA^A^) + 3h{Au,Ay) + 3h{A.^A^) 

+ h{A,Ay) + h{A,A,) + 3h{AyA,) - h{A^^) - 2h{Ay) - 2h{A,) > 

is a valid non-Shannon-type inequality for differential entropy. Alternatively, this inequality can 
also be deduced from the results of Chan [13, Thm. 2] on the relation between entropic inequalities 
for discrete and continuous variables. 

We now claim that the partial polymatroid /^^ can also be realized as the collection of entropies 
of a continuous- variable marginal model. In order to do so, the bits in the proof of lemma 7. 1 should 
be replaced by independent copies of a continuous variable with uniform distribution on [0, 2] (which 
has a differential entropy of 1). This yields the desired continous- variable marginal model on Mzy- 
Thanks to proposition 6.1, we know that no Shannon- type inequality for differential entropy can 
detect its contextuality, although (32) does. 

Remark 7.3. The Fourier-Motzkin elimination approach of section 4 can easily be amended so as 
to deal with some non-Shannon-type inequalities, too. Instead of only using the basic inequalities 
as the initial input to the Fourier-Motzkin solver, one can additionally provide a finite list of non- 
Shannon-type inequalities to begin with, and the Fourier-Motzkin solver will then also take those 
into account while deriving entropic inequalities applicable to a marginal scenario. However, for 
the practical computations that we have done [16], this has not improved the results. 
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8. Open problems 

We would now like to mention some relevant questions, which are, to the best of our knowledge, 
still open. 

Partial polymatroids: 

1. Vorob'ev [57] has found a complete characterization of those marginal scenarios in which 
all marginal models are non-contextual. The analogous question in the polymatroid case 
is this: how can one characterize the class of all marginal scenarios in which all partial 
polymatroids are non-contextual? Is the answer the same as in [57]? 

2. Proposition 5.1 implies that for the marginals scenarios C„, recognizing (non-)contextuality 
of a partial polymatroid can be done in polynomial time. What about the complexity of 
this problem for other families of marginal scenarios? Is the general case even in NPl 

3. Under which conditions on can every non-contextual polymatroid be realized as an 
entropic polymatroid? In other words, for which AA can one show that all entropic in- 
equalities are Shannon-type, like in corollary 5.3? For example, does this also hold when 

is a complete graph instead of a cycle graph? If so, then this would mean that there 
are no non-Shannon- type entropic inequalities in which each term is of the form H{Ai) or 
H{A,A,). 

Entropic inequalities and marginal problems: 

4. We have seen that taking entropies can turn contextual marginal models into contextual 
partial polymatroids. Which contextual polymatroids can arise in this way? 

5. Upon fixing a finite set of possible outcomes for each variable, the marginal models in a 
given marginal scenario form a convex polytope. All the examples we have found so far [16] 
have the property that taking entropies of an extreme point of this polytope maps it to a 
non-contextual partial polymatroid. Is this always the case? 

6. When writing the entropic inequalities (27) in terms of mutual information, the resulting 
inequalities 

Y^I{A,:A,^,)~I{A,:A,+,)< ^ H{A,) 

bear a great similarity to the inequalities derived in [3] , which detect contextuality in the n- 
cycle scenario for binary random variables on the level of probabilities rather than entropies. 
Does this similarity extend to other marginal scenarios? If so, this would provide an inter- 
esting alternative to the computationally costly Fourier-Motzkin elimination for generating 
Shannon-type entropic inequalities detecting the contextuality of partial polymatroids. 
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